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ABSTRACT 

Massive graphs, such as online social networks and communication 
networks, have become common today. To efficiently analyze such 
large graphs, many distributed graph computing systems have been 
developed. These systems employ the “think like a vertex” pro¬ 
gramming paradigm, where a program proceeds in iterations and at 
each iteration, vertices exchange messages with each other. How¬ 
ever, using Pregel’s simple message passing mechanism, some ver¬ 
tices may send/receive significantly more messages than others due 
to either the high degree of these vertices or the logic of the algo¬ 
rithm used. This forms the communication bottleneck and leads to 
imbalanced workload among machines in the cluster. In this paper, 
we propose two effective message reduction techniques: (l)vertex 
mirroring with message combining, and (2)an additional request- 
respond API. These techniques not only reduce the total number 
of messages exchanged through the network, but also bound the 
number of messages sent/received by any single vertex. We the¬ 
oretically analyze the effectiveness of our techniques, and imple¬ 
ment them on top of our open-source Pregel implementation called 
Pregel+. Our experiments on various large real graphs demonstrate 
that our message reduction techniques significantly improve the 
performance of distributed graph computation. 

Categories and Subject Descriptors 

D.4.7 [Organization and Design]: Distributed systems 

General Terms 

Performance 
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1. INTRODUCTION 

With the growing interest in analyzing large real-world graphs 
such as online social networks, web graphs and semantic web graphs, 
many distributed graph computing systems |f||5||T0||TT|[T3||T8||2l[ 
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|23| have emerged. These systems are deployed in a shared-nothing 
distributed computing infrastructure usually built on top of a cluster 
of low-cost commodity PCs. Pioneered by Google’s Pregel (13), 
these systems adopt a vertex-centric computing paradigm, where 
programmers think naturally like a vertex when designing distributed 
graph algorithms. A Pregel-like system also takes care of fault 
recovery and scales to arbitrary cluster size without the need of 
changing the program code, both of which are indispensable prop¬ 
erties for programs running in a cloud environment. 

MapReduce (3), and its open-source implementation Hadoop, 
are also popularly used for large scale graph processing. However, 
many graph algorithms are intrinsically iterative, such as the com¬ 
putation of PageRank, connected components, and shortest paths. 
For iterative graph computation, a Pregel program is much more 
efficient than its MapReduce counterpart 03- 

Weaknes.ses of Pregel. Although Pregel’s vertex-centric comput¬ 
ing model has been widely adopted in most of the recent distributed 
graph computing systems [Tj [TTj [Toj |18| (and also inspired the 
edge-centric model j5J), Pregel’s vertex-to-vertex message pass¬ 
ing mechanism often causes bottlenecks in communication when 
processing real-world graphs. 

To clarify this point, we first briefly review how Pregel performs 
message passing. In Pregel, a vertex v can send messages to an¬ 
other vertex u if v knows u’s vertex ID. In most cases, v only sends 
messages to its neighbors whose IDs are available from v’s adja¬ 
cency list. But there also exist Pregel algorithms in which a vertex 
v may send messages to another vertex that is not a neighbor of 
v (24| |19| . These algorithms usually adopt pointer jumping (or 
doubling), a technique that is widely used in designing PRAM al¬ 
gorithms |22i|, to bound the number of iterations by 0(log|Vj), 
where \ V\ refers to the number of vertices in the graph. 

The problem with Pregel’s message passing mechanism is that 
a small number of vertices, which we call bottleneck vertices, may 
send/receive much more messages than other vertices. A bottleneck 
vertex not only generates heavy communication, but also signifi¬ 
cantly increases the workload of the machine in which the vertex 
resides, causing highly imbalanced workload among different ma¬ 
chines. Bottleneck vertices are common when using Pregel to pro¬ 
cess real-world graphs, mainly due to either (l)high vertex degree 
or (2)algorithm logic, which we elaborate more as follows. 

We first consider the problem caused by high vertex degree. When 
a high-degree vertex sends messages to all its neighbors, it becomes 
a bottleneck vertex. Unfortunately, real-world graphs usually have 
highly skewed degree distribution, with some vertices having veiw 
high degrees. For example, in the Twitter who-follows-who grapfrl 
the maximum degree is over 2.99M while the average degree is 

1 http://law.di.unimi.it/webdata/twitter-2010/ 



only 35. Similarly, in the BTC dataset used in our experiments, 
the maximum degree is over 1.6M while the average degree is only 
4.69. 

We ran Hash-Min | [17||24J , a distributed algorithm for computing 
connected components (CCs), on the degree-skewed BTC dataset in 
a cluster with 1 master (Worker 0) and 120 slaves (Workers 1-120), 
and observed highly imbalanced workload among different work¬ 
ers, which we describe next. Pregel assigns each vertex to a worker 
by hashing the vertex ID regardless of the degree of the vertex. As 
a result, each worker holds approximately the same number of ver¬ 
tices, but the total number of neighbors in the adjacency lists (i.e., 
number of edges) varies greatly among different workers. In the 
computation of Hash-Min on BTC , we observed an uneven distri¬ 
bution of edge number among workers, as some workers contain 
more high-degree vertices than other workers. Since messages are 
sent along the edges, the uneven distribution of edge number also 
leads to an uneven distribution of the amount of communication 
among different workers. In Figure [T] the taller blue bars indicate 
the total number of messages sent by each worker during the entire 
computation of Hash-Min, where we observe highly uneven com¬ 
munication workload among different workers. 

Bottleneck vertices may also be generated by program logic. An 
example is the S-V algorithm proposed in (24, 221 for computing 
CCs, which we will describe in detail in Section [L4] In S-V, each 
vertex v maintains a field D [v] which records the vertex that v is to 
communicate with. The field D[v\ may be updated at each iteration 
as the algorithm proceeds; and when the algorithm terminates, ver¬ 
tices Vi and Vj are in the same CC iff D[vt] = D[vj\. Thus, during 
the computation, some vertex u may communicate with many ver¬ 
tices {v\,V 2 , ■ ■ ■, Vk} in its CC if u = D[vt], for 1 < i < k. In 
this case, u becomes a bottleneck vertex. 

We ran S-V on the USA road network in a cluster with 1 mas¬ 
ter (Worker 0) and 60 slaves (Workers 1-60), and observed highly 
imbalanced communication workload among different workers. In 
Figure[2] the taller blue bars indicate the total number of messages 
sent by each worker during the entire computation of S-V, where 
we can see that the communication workload is very biased (espe¬ 
cially at Worker 0). We remark that the imbalanced communica¬ 
tion workload is not caused by skewed vertex degree distribution, 
since the largest vertex degree of the USA road network is merely 
9. Rather, it is because of the algorithm logic of S-V. Specifically, 
since the USA road network is connected, in the last round of 5- 
V, all vertices v have D[v] equal to Vertex 0, indicating that they 
all belong to the same CC. Since Vertex 0 is hashed to Worker 0, 
Worker 0 sends much more messages than the other workers, as 
can be observed from Figure[2] 

In addition to the two problems mentioned above, Pregel’s mes¬ 
sage passing mechanism is also not efficient for processing graphs 
with (relatively) high average degree due to the high overall com¬ 
munication cost. Flowever, many real-world graphs such as social 
networks and mobile phone networks have relatively high average 
degree, as a person is often connected to at least dozens of people. 

Our Solution. In this paper, we solve the problems caused by 
Pregel’s message passing mechanism with two effective message 
reduction techniques. The goals are to (l)mitigate the problem 
of imbalanced workload by eliminating bottleneck vertices, and to 
(2)reduce the overall number of messages exchanged through the 
network. 

The first technique is called mirroring, which is designed to 
eliminate bottleneck vertices caused by high vertex degree. The 
main idea is to construct mirrors of each high-degree vertex in dif¬ 
ferent machines, so that messages from a high-degree vertex are 
forwarded to its neighbors by its mirrors in local machines. Let 
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Figure 1: Hash-Min on BTC (with/without mirroring) 
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Figure 2: S- V on USA (with/without request-respond) 

d(v) be the degree of a vertex v and M be the number of machines 
in the cluster, mirroring bounds the number of messages sent by v 
each time to min{M, d(u)}. If v is a high-degree vertex, d(v) can 
be up to millions, but M is normally only from tens to a few hun¬ 
dred. We remark that ideas similar to mirroring have been adopted 
by existing systems (ll[|18) , but we find that mirroring a vertex 
does not always reduce the number of messages due to Pregel’s use 
of message combiner G3- Hence, we provide a theoretical analy¬ 
sis on which vertices should be selected for mirroring in Section[5] 

In Figure[I] the short red bars indicate the total number of mes¬ 
sages sent by each worker when mirroring is applied to all vertices 
with degree at least 100. We can clearly see the big difference be¬ 
tween the uneven blue bars (without mirroring) and the even-height 
short red bars (with mirroring). Furthermore, the number of mes¬ 
sages is also significantly reduced by mirroring. We remark that the 
algorithm is still the same and mirroring is completely transparent 
to users. Mirroring reduces the running time of Hash-Min on BTC 
from 26.97 seconds to 9.55 seconds. 

The second technique is a new request-respond paradigm. We 
extend the basic Pregel framework by an additional request-respond 
functionality. A vertex u may request another vertex v for its at¬ 
tribute a(v), and the requested value will be available in the next 
iteration. The request-respond programming paradigm simplifies 
the coding of many Pregel algorithms, as otherwise at least three 
iterations are required to explicitly code each request and response 
process. More importantly, the request-respond paradigm effec¬ 
tively eliminates the bottleneck vertices resulted from algorithm 
logic, by bounding the number of response messages sent by any 
vertex to M. Consider the S- V algorithm mentioned earlier, where a 
set of k vertices {vi,i> 2 , ■ ■ ■, Vk} with D[vt] = u require the value 
of D[u] from u (thus there are k requests and responses). Under 
the request-respond paradigm, all the requests from a machine to 
the same target vertex are merged into one request. Therefore, at 
most min{ M, k} requests are needed for the k vertices and at most 
min{AT, fc} responses are sent from u. For large real-world graphs, 
k is often orders of magnitude greater than M. 

In Figure [2] the short red bars indicate the total number of mes¬ 
sages sent by each worker when the request-respond paradigm is 
applied. Again, the skewed message passing represented by the 
blue bars are now replaced by the even-height short red bars. In 
particular, Vertex 0 now only responds to the requesting workers 
instead of all the requesting vertices in the last round, and hence 
the highly imbalanced workload caused by Vertex 0 in Worker 0 is 
now evened out. The request-respond paradigm reduces the run¬ 
ning time of S-V on the USA road network from 261.9 seconds to 
137.7 seconds. 
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Figure 3: Illustration of combiner 


Finally, we remark that our experiments were run in a cluster 
without any resource contention, and our optimization techniques 
are expected to improve the overall performance of Pregel algo¬ 
rithms more significantly if they were run in a public data center, 
where the network bandwidth is lower and reducing communica¬ 
tion overhead becomes more important. 

The rest of the paper is organized as follows. We review existing 
parallel graph computing systems, and highlight the differences of 
our work from theirs, in Section[2] In Section[3] we describe some 
Pregel algorithms for problems that are common in social network 
analysis and web analysis. In Section [4] we introduce the basic 
communication framework. We present the mirroring technique 
and the request-respond functionality in Sections [5] and [6] Finally, 
we report the experimental results in Section [7] and conclude the 
paper in Section[8] 


2. BACKGROUND AND RELATED WORK 

We first review Pregel’s framework, and then discuss other re¬ 
lated distributed graph computing systems. 

2.1 Pregel 

Pregel ED is designed based on the bulk synchronous parallel 
(BSP) model. It distributes vertices to different machines in a clus¬ 
ter, where each vertex v is associated with its adjacency list (i.e., 
the set of v’s neighbors). A program in Pregel implements a user- 
defined computeQ function and proceeds in iterations (called su¬ 
persteps). In each superstep, the program calls computeQ for each 
active vertex. The computeQ function performs the user-specified 
task for a vertex v, such as processing v’s incoming messages (sent 
in the previous superstep), sending messages to other vertices (to 
be received in the next superstep), and making v vote to halt. A 
halted vertex is reactivated if it receives a message in a subsequent 
superstep. The program terminates when all vertices vote to halt 
and there is no pending message for the next superstep. 

Pregel numbers the supersteps so that a user may use the cur¬ 
rent superstep number when implementing the algorithm logic in 
the computeQ function. As a result, a Pregel algorithm can per¬ 
form different operations in different supersteps by branching on 
the current superstep number. 

Message Combiner. Pregel allows users to implement a combineQ 
function, which specifies how to combine messages that are sent 
from a machine Mi to the same vertex v in a machine Mj. These 
messages are combined into a single message, which is then sent 
from Mi to v in Mj. However, combiner is applied only when com¬ 
mutative and associative operations are to be applied to the mes¬ 
sages. For example, in the PageRank computation, the messages 
sent to a vertex v are to be summed up to compute v’s PageRank 
value; in this case, we can combine all messages sent from a ma¬ 
chine Mi to the same target vertex in a machine Mj into a single 
message that equals their sum. Figure[3]illustrates the idea of com¬ 


biner, where the messages sent by vertices in machine M i to the 
same target vertex Vj in machine M 2 are combined into their sum 
before sending. 

Aggregator. Pregel also supports aggregator, which is useful for 
global communication. Each vertex can provide a value to an ag¬ 
gregator in computeQ in a superstep. The system aggregates those 
values and makes the aggregated result available to all vertices in 
the next superstep. 

2.2 Pregel-Like Systems in JAVA 

Since Google’s Pregel is proprietary, many open-source Pregel 
counterparts are developed. Most of these systems are implemented 
in JAVA, e.g., Giraph |Tj and GPS fl8) . They read the graph data 
from Hadoop’s DFS (HDFS) and write the results to HDFS. How¬ 
ever, since object deletion is handled by JAVA'S Garbage Collector 
(GC), if a machine maintains a huge amount of vertex/edge objects 
in main memory, GC needs to track a lot of objects and the over¬ 
head can severely degrade the system performance. To decrease the 
number of objects being maintained, JAVA-based systems maintain 
vertices in main memory in their binary representation. For exam¬ 
ple, Giraph organizes vertices as main memory pages, where each 
page is simply a byte array object that holds the binary representa¬ 
tion of many vertices. As a result, a vertex needs to be deserialized 
from the page holding it before calling compute (); and after com¬ 
puteQ completes, the updated vertex needs to be serialized back 
to its page. The serialization cost can be high, especially if the 
adjacency list is long. To avoid unnecessary serialization cost, a 
Pregel-like system should be implemented in a language such as 
C/C++, where programmers (who are system developers, not end 
users) manage main memory objects themselves. We implemented 
our Pregel+ system in C/C++. 

GPS 118| supports an optimization called large adjacency list 
partitioning (LALP) to handle high-degree vertices, whose idea is 
similar to vertex mirroring. However, GPS does not explore the 
performance tradeoff between vertex mirroring and message com¬ 
bining. Instead, it is claimed in El that very small performance 
difference can be observed whether combiner is used or not, and 
thus, GPS simply does not perform sender-side message combin¬ 
ing. Our experiments in Section [7] show that sender-side message 
combining significantly reduces the overall running time of Pregel 
algorithms, and therefore, both vertex mirroring and message com¬ 
bining should be used to achieve better performance. As we shall 
see in Section[5] vertex mirroring and message combining are two 
conflicting message reduction techniques, and a theoretical analy¬ 
sis on their performance tradeoff is needed in order to devise a cost 
model for automatically choosing vertices for mirroring. 

2.3 GraphLab and PowerGraph 

GraphLab ED is another parallel graph computing system that 
follows a design different from Pregel. GraphLab supports asyn¬ 
chronous execution, and adopts a data pulling programming paradigm. 
Specifically, each vertex actively pulls data from its neighbors, rather 
than passively receives messages sent/pushed by its neighbors. This 
feature is somewhat similar to our request-respond paradigm, but 
in GraphLab, the requests can only be sent to the neighbors. As 
a result, GraphLab cannot support parallel graph algorithms where 
a vertex needs to communicate with a non-neighbor. Such algo¬ 
rithms are, however, quite popular in Pregel as they make use of 
the pointer jumping (or doubling) technique of PRAM algorithms 
to bound the number of iterations by 0(log | Vj). Examples include 
the S-V algorithm for computing CCs |24| and Pregel algorithm for 
computing minimum spanning forest fT9|. These algorithms can 
benefit significantly from our request-respond technique. Recently, 










several studies |8]|12| reported that GraphLab’s asynchronous ex¬ 
ecution is generally slower than its synchronous mode (that simu¬ 
lates Pregel’s model) due to the high locking/unlocking overhead. 
Thus, we mainly focus on Pregel’s computing model in this paper. 

GraphLab also builds mirrors for vertices, which are called ghosts. 
However, GraphLab creates mirrors for every vertex regardless of 
its degree, which leads to excessive space consumption. A more 
recent version of GraphLab, called PowerGraph (5), partitions the 
graph by edges rather than by vertices. Edge partitioning mitigates 
the problem of imbalanced workload as the edges of a high-degree 
vertex are handled by multiple workers. Accordingly, a new edge¬ 
centric Gather-Apply-Scatter (GAS) computing model is used in¬ 
stead of the traditional vertex-centric computing model. 

3. PREGEL ALGORITHMS 

In this section, we describe some Pregel algorithms for prob¬ 
lems that are common in social network analysis and web analysis, 
which will be used for illustrating important concepts and for per¬ 
formance evaluation. 

We consider fundamental problems such as (l)computing con¬ 
nected components (or bi-connected components), which is a com¬ 
mon preprocessing step for social network analysis |14[|15| ; ^com¬ 
puting minimum spanning tree (or forest), which is useful in min¬ 
ing social relationships and (3)computing PageRank, which 
is widely used in ranking web pages |16|[9) and spam detection^. 

For ease of presentation, we first define the graph notations used 
in the paper. Given an undirect graph G = (V,E), we denote 
the neighbors of a vertex v G V by T(u), and the degree of v by 
d(v) = | r(v) |; if G is directed, we denote the in-neighbors (out- 
neighbors) of a vertex v by r;„(u) (r oU i(ti)). and the in-degree 
(out-degree) of v by di„(v) = |L ln (u)| (d ou t{v) = |r out (t;)|). 
Each vertex v G V has a unique integer ID, denoted by id(v). The 
diameter of G is denoted by 8. 

3.1 Attribute Broadcast 

We first introduce a Pregel algorithm for attribute broadcast. 
Given a directed graph G, where each vertex v is associated with 
an attribute a(v) and an adjacency list that contains the set of v’s 
out-neighbors r ou t(n), attribute broadcast constructs a new adja¬ 
cency list for each vertex v in G, which is defined as Y ou t(v) = 
{(«, a(u))\u G r out (t,)}. 

Put simply, attribute broadcast associates each neighbor u in the 
adjacency list of a vertex v with u’s attribute a(u). Attribute broad¬ 
cast is very useful in distributed graph computation, and it is a fre¬ 
quently performed key operation in many Pregel algorithms. For 
example, the Pregel algorithm for computing bi-connected compo¬ 
nents |24| requires to relabel the ID of each vertex u by its preorder 
number in the spanning tree, denoted by pre(u). Attribute broad¬ 
cast is used in this case, where a(u) refers to pre(u). 

The Pregel algorithm for attribute broadcast consists of 3 su¬ 
persteps: in superstep 1, each vertex v sends a message (v) to each 
neighbor u G r otl t(n) to request for a(u); then in superstep 2, each 
vertex u obtains the requesters v from the incoming messages, and 
sends the response message {u, a(u)) to each requester v, finally 
in superstep 3, each vertex v collects the incoming messages to 
construct Y ou t {y). 

3.2 PageRank 

Next we present a Pregel algorithm for PageRank computation. 
Given a directed web graph G = (H, E), where each vertex (page) 
v links to a list of pages r otl t(t)), the problem is to compute the 
PageRank, pr(v), of each vertex v G V. 
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Figure 4: Forest structure of the S- V algorithm 
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Figure 5: Key operations of the S- V algorithm 


Pregel's PageRank algorithm G3 works as follows. In super¬ 
step 1, each vertex v initializes pr(v) = 1/|H| and distributes the 
value (pr(v)/d ou t(v)) to each out-neighbor of v. In superstep i 
(i > 1), each vertex v sums up the received values from its in¬ 
neighbors, denoted by sum, and computes pr(v) = 0.15/|Vj + 
0.85 x sum. It then distributes (pr(v) / d 0 ut(v)) to each of its out- 
neighbors. 

3.3 Hash-Min 

We next present a Pregel algorithm for computing connected 
components (CCs) in an undirected graph. We adopt the Hash- 
Min algorithm (17[ |24| . Given a CC G, let us denote the set of 
vertices of G by H(G), and define the ID of G to be id(C) = 
min{id(u) : v G H(G)}. We further define the color of a vertex 
v as cc(v) = id(C), where v G V(G). Hash-Min computes cc(v) 
for each vertex v G V, and the idea is to broadcast the smallest 
vertex ID seen so far by each vertex v, denoted by min(v). When 
the algorithm terminates, min(v) = cc(v) for each vertex v £ V. 

We now describe the Hash-Min algorithm in Pregel framework. 
In superstep 1, each vertex v sets min(v) to be id(v), broadcasts 
min(v) to all its neighbors, and votes to halt. In each subsequent 
superstep, each vertex v receives messages from its neighbors; let 
min* be the smallest ID received, if min* < min(v), v sets 
min(v) = min* and broadcasts min* to its neighbors. All ver¬ 
tices vote to halt at the end of a superstep. When the process con¬ 
verges, all vertices have voted to halt and for each vertex v, we have 
min(v) = cc(v). 


3.4 The S-V Algorithm 

The Hash-Min algorithm described in Section 3.3 requires 0(8) 
supersteps which can be slow for computing CCs in large- 
diameter graphs. Another Pregel algorithm proposed in |j24) com¬ 
putes CCs in 0(log | HI) supersteps, by adapting Shiloach-Vishkin’s 
(S-V) algorithm for the PRAM model (22) . We use this algorithm 
to demonstrate how algorithm logic generates a bottleneck vertex v 
even if d(v) is small. 

In the S-V algorithm, each vertex u maintains a pointer D[u ], 
which is initialized as u, forming a self loop as shown Figure |4)a). 
During the computation, vertices are organized into a forest such 
that all vertices in a tree belong to the same CC. The tree definition 
is relaxed a bit here to allow the tree root w to have a self-loop, i.e., 
D[w\ = w (see Figures |4jb) and[4jc)); while D[v\ of any other 
vertex v in the tree points to v’s parent. 





The S-V algorithm proceeds in rounds, and in each round, the 
pointers are updated in three steps (illustrated in Figure |5j: (1 )tree 
hooking : for each edge (u, v), if u’s parent w = D[u\ is a tree 
root, hook in as a child of v’s parent D[v], i.e., set D[.D[ii]] = 
D[v]; (2)star hooking : for each edge (u,v), if it is in a star (see 
Figure |4jc) for an example of star), hook the star to v’s tree as in 
Step (1), i.e., set D[D[u]] = D[v]\ ( 3)shortcutting: for each vertex 
v, move vertex v and its descendants closer to the tree root, by 
hooking v to the parent of v’s parent, i.e., setting D[v] = D[D[v]}. 
The above three steps execute in rounds, and the algorithm ends 
when every vertex is in a star. 

Due to the shortcutting operation, the S-V algorithm creates flat¬ 
tened trees (e.g., stars) with large fan-out towards the end of the 
execution. As a result, a vertex w may have many children u (i.e., 
D[u] = w), and each of these children u requests w for the value 
of D[w\. This renders w a bottleneck vertex. In particular, in 
the last round of the S-V algorithm, all vertices v in a CC C have 
D[v\ = id(C), and they all send requests to the vertex w = id(C) 
for D[w]. In the basic Pregel framework, w receives |V(C)| re¬ 
quests and sends | V ( C ) | responses, which leads to skewed work¬ 
load when |1/(C)| is large. 

3.5 Minimum Spanning Forest 

The Pregel algorithm proposed by G3 for minimum spanning 
forest (MSF) computation is another example that shows how al¬ 
gorithm logic can generate bottleneck vertices. This algorithm pro¬ 
ceeds in iterations, where each iteration consists of three steps, 
which we describe below. 

In Step (1), each vertex v picks an edge with the minimum weight. 
The vertices and their picked edges form disjoint subgraphs, each 
of which is a conjoined-tree: two trees with their roots joined by 
a cycle. Figure [6] illustrates the concept of a conjoined-tree, where 
the edges are those picked in Step (1). The vertex with the smaller 
ID in the cycle of a conjoined-tree is called the supervertex of the 
tree (e.g., vertex 5 is the supervertex in Figure [6]», and the other 
vertices are called the subvertices. 

In Step (2), each vertex finds the supervertex of the conjoined- 
tree it belongs to, which is accomplished by pointer jumping. Specif¬ 
ically, each vertex v maintains a pointer D [d] ; suppose that v picks 
edge (v, u) in Step (1), then the value of D[v\ is initialized as u. 
Each vertex v then sends request to w = D[v\ for D[w\. Initially, 
the actual supervertex s (e.g., vertex 5 in Figure[6]l and its neighbor 
s' in the cycle (e.g. vertex 6 in Figure^ see that they have sent each 
other messages and detect that they are in the cycle. Vertex s then 
sets itself as the supervertex (i.e., sets D[s] = s ) due to s < s', 
before responding D[s] = s to the requesters (while D[s'] = s 
remains for s' since s' > s). For any other vertex v, it receives 
response D[w] from w = D[v\ and updates D[v] to be D\w\. This 
process is repeated until convergence, upon when D[v\ records the 
supervertex s for all vertices v. 

In Step (3), each vertex v sends request to each neighbor u £ 
T(u) for its supervertex D\u], and removes edge (u, u) if D[v\ = 
D[u] (i.e., v and u are in the same conjoined-tree); v then sends 
the remaining edges (to vertices in other conjoined-trees) to the 


supervertex D[v]. After this step, all subvertices are condensed 
into their supervertex, which constructs an adjacency list of edges 
to the other supervertices from those edges sent by its subvertices. 

We consider an improved version of the above algorithm that ap¬ 
plies the Storing-Edges-At-Subvertices (SEAS) optimization of [ (T9) . 
Specifically, instead of having the supervertex merge and store all 
cross-tree edges, the SEAS optimization stores the edges of a su¬ 
pervertex in a distributed fashion among all of its subvertices. As 
a result, if a supervertex s is merged into another supervertex, it 
has to notify its subvertices of the new supervertex they belong to. 
This is accomplished by having each vertex v send request to its 
supervertex D[v\ = s for D[s]. Since smaller conjoined-trees are 
merged into larger ones, a supervertex s may have many subver¬ 
tices v towards the end of the execution, and they all request for 
D[.s] from s, rendering s a bottleneck vertex. 

4. BASIC COMMUNICATION FRAMEWORK 

When considering on which system we should implement our 
message reduction techniques, we decided to implement a new 
open-source Pregel system in C/C++, called Pregel+, to avoid the 
pitfalls of a lAVA-based system described in Section |2.2| Other 
reasons for a new Pregel implementation include: (l)GPS does not 
perform sender-side message combining, while our work studies 
effective message reduction techniques in a system that adheres 
to Pregel’s framework, where message combining is supported; 
(2)Giraph has been shown to have inferior performance in recent 
performance evaluation of graph-parallel systems @000[2O| 
and also in our experiments; (3)other existing graph computing sys¬ 
tems are also not suitable as described in Sections [23] and ??. 

We first introduce the basic communication framework of Pregel+. 
Our two new message reduction techniques to be introduced in Sec- 
tions[5]and[6]further extend the basic communication framework. 

We use the term “worker” to represent a computing unit, which 
can be a machine or a thread/process in a machine. For ease of 
discussion, we assume that each machine runs only one worker but 
the concepts can be straightforwardly generalized. 

In Pregel+, each worker is simply an MPI (Message Passing In¬ 
terface) process and communications among different processes are 
implemented using MPI’s communication primitives. Each worker 
maintains a message channel, Chmsg, for exchanging the vertex- 
to-vertex messages. In the compute{) function, if a vertex sends a 
message msg to a target vertex Vt g t, the message is simply added 
tO Chmsg • Like in Google’s Pregel, messages in Chmsg are sent 
to the target workers in batches before the next superstep begins. 
Note that if a message msg is sent from worker Mi to vertex vt g t 
in worker Mj, the ID of the target Vt g t should be sent along with 
msg, so that when Mj receives msg, it knows which vertex msg 
should be directed to. 

The operation of the message channel Chmsg is directly related 
to the communication cost and hence affects the overall perfor¬ 
mance of the system. We tested different ways of implementing 
Chmsg, and the most efficient one is presented in Figure [7] We 
assume that a worker maintains N vertices, {iq, V 2 , ..., viff. The 
message channel Chmsg associates each vertex Vi with an incom¬ 
ing message buffer /,. When an incoming message msgi directed 
to vertex Vi arrives, Chmsg looks up a hash table Ti„ for the in¬ 
coming message buffer It using v/s ID. It then appends msgi to 
the end of It . The lookup table is static unless graph mutation 
occurs, in which case updates to Ti n may be required. Once all in¬ 
coming messages are processed, computeQ is called for each active 
vertex Vi with the messages in T, as the input. 

A worker also maintains M outgoing message buffers (where M 
is the number of workers), one for each worker Mj in the cluster, 





In-Msg 

Buffer 


Vertices 


Out-Msg 

Buffer 




© N 

1 (Mj) + a(u 2 

© 

) + 0 ( 1 / 3 )^ 

©' 

a(u 4 ) 

/© 

JIT]'' 



1 " 4 f 


Figure 7: Illustration of Message Channel, Ch„ 


M i M 2 

(a) Adjacency Lists (b) Messages to v 2 

Figure 9: Mirroring v.s. Message Combining 



M 2 M, M } M 2 M, M 3 

(a) Pregel’s Message Passing (b) Message Passing With Mirrors 


THEOREM 1 . Let d(v ) be the degree of a vertex v and M be the 
number of machines. Suppose that v is to deliver a message a(v ) 
to all its neighbors in one superstep. If mirroring is applied on v, 
then the total number of messages sent by v in order to deliver a(v) 
to all its neighbors is bounded by min{M, d(v)}. 

PROOF. The proof follows directly from the fact that v only 
needs to send one message a(v) to each of its mirrors in other ma¬ 
chines and there are at most min{M, d(v)} mirrors of v. □ 


Figure 8 : Illustration of Mirroring 

denoted by Oj . In compute!), a vertex Vi may send a message msj 2 
to another vertex with ID tgt. Let hash{.) be the hash function 
that computes the worker ID of a vertex from its vertex ID. then the 
target vertex is in worker M hash ( tgt y Thus, msj 2 (along with tgt) 
is appended to the end of the buffer Ohash(t g t)- Messages in each 
buffer Oj are sent to worker Mj in batch. If a combiner is used, the 
messages in a buffer Oj are first grouped (sorted) by target vertex 
IDs, and messages in each group are combined into one message 
using the combiner logic before sending. 

5. THE MIRRORING TECHNIQUE 

The mirroring technique is designed to eliminate bottleneck ver¬ 
tices caused by high vertex degree. 

Given a high-degree vertex v, we construct a mirror for v in any 
worker in which some of u’s neighbors reside. When v needs to 
send a message, e.g., the value of its attribute, a(v), to its neigh¬ 
bors, v sends a(n) to its mirrors. Then, each mirror forwards a(v) 
to the neighbors of v that reside in the same local worker as the 
mirror, without any message passing. 

Figure [ 8 ] illustrates the idea of mirroring. Assume that m is 
a high-degree vertex residing in worker machine Mi, and u, has 
neighbors {wi, V2, ■ ■ ■, Vj} residing in machine M 2 and neighbors 
{wi,u> 2 ,... ,Wk} residing in machine M 3 . Suppose that Ui needs 
to send a message a(m) to the j neighbors in M 2 and k neighbors 
in M 3 . Figure [ 8 f a) shows how Ui sends a{uf) to its neighbors in 
M 2 and M 3 using Pregel’s vertex-to-vertex message passing. In 
total, (j + k) messages are sent, one for each neighbor. To apply 
mirroring, we construct a mirror for Ui in M 2 and M 3 , as shown 
by the two squares (with label m) in Figure[ 8 ]b). In this way, as 
illustrated in Figure [ 8 jb), ui only needs to send a(ui) to the two 
mirrors in M 2 and A/ 3 . Then, each mirror forwards a{uf) to uf s 
neighbors locally in M 2 and M 3 without any network communi¬ 
cation. In total, only two messages are sent through the network, 
which not only tremendously reduces the communication cost, but 
also eliminates the imbalanced communication load caused by Uj. 

We formalize the effectiveness of mirroring for message reduc¬ 
tion by the following theorem. 


Mirroring Threshold. The mirroring technique is transparent to 
programmers. But we can allow users to specify a mirroring thresh¬ 
old r such that mirroring is applied to a vertex v only if d(v) > r 
(we will see shortly that r can be automatically set by a cost model 
following the result of Theorem[2]). If a vertex has degree less than 
r, it sends messages through the normal message channel Chms g 
as usual. Otherwise, the vertex only sends messages to its mirrors, 
and we call this message channel as the mirroring message channel, 
or Ch m i r in short. In a nutshell, a message is sent either through 
Ch msg or Chmi r , depending on the degree of the sending vertex. 

Figure [9] illustrates the concepts of Ch msg and Chmir, where 
we only consider the message passing between two machines Mi 
and M 2 . The adjacency lists of vertices u\, U2, 113 and 114 in Mi 
are shown in Figure[9ja), and we consider how they send messages 
to their common neighbor V2 residing in machine M 2 . Assume that 
t = 3 , then as Figure |9jb) shows. in, U2 and 113 send their mes¬ 
sages, a(ui), a(U2) and 0 ( 113 ), through Ch m s g , while u 4 sends its 
message 0 ( 114 ) through Chmir. 

Mirroring v.s. Message Combining. Now let us assume that the 
messages are to be applied with commutative and associative op¬ 
erations at the receivers’ side, e.g., the message values are to be 
summed up as in PageRank computation. In this case, a com¬ 
biner can be applied on the message channel Ch msg . However, the 
receiver-centric message combining is not applicable to the sender¬ 
centric channel Chmir- For example, in Figure |9jb), when 114 in 
Mi sends 0 ( 114 ) to its mirror in A/ 2,114 does not need to know the 
receivers (i.e., vi, V 2 , V 3 and U4); thus, its message to 1)2 cannot be 
combined with those messages from m, 112 and 113 that are also to 
be sent to V 2 ■ In fact, u 4 only holds a list of the machines that con¬ 
tain 114 ’s neighbors, i.e. {M 2 } in this example, and 114 ’s neighbors 
Vi, 1 ) 2 , 1)3 and 1)4 that are local to M 2 are connected by 114 ’s mirror 
in M 2 . 

It may appear that 114 ’s message to its mirror is wasted, because 
if we combine 114 ’s message with those messages from ui, 112 and 
113 , then we do not need to send it through Ch m ir- However, we 
note that a high-degree vertex like 114 often has many neighbors in 
another worker machine, e.g., vi, 1)3 and 1)4 in addition to 1)2 in this 
example, and the message is not wasted since the message is also 
forwarded to 1)3 and 1 ) 4 , which are not the neighbors of any other 
vertex in Mi. 



































































Choice of Mirroring Threshold. The above discussion shows that 
there are cases where mirroring is useful, but it does not give any 
formal guideline as to when exactly mirroring should be applied. 
To this end, we conduct a theoretical analysis below on the inter¬ 
play between mirroring and message combining. Our result shows 
that mirroring is effective even when message combiner is used. 


are not wasted. On the other hand, if mirroring is used, Vj sends 
at most M messages, one to each mirror. Therefore, mirroring re¬ 
duces the number of messages if ij ■ exp {—deg a vg/M} > M , 
or equivalently, ij > M ■ exp{deg avg / M}. To conclude, choos¬ 
ing r = M ■ exp{deg a vg/M} as the degree threshold reduces the 
communication cost. iEJ 


THEOREM 2 . Given a graph G = (V,E) with n = \V\ ver¬ 
tices and m = \E\ edges, we assume that the vertex set is evenly 
partitioned among M machines (e.g., by hashing as in Pregel) and 
each machine holds n/M vertices. We further assume that the 
neighbors of a vertex in G are randomly chosen among V, and 
the average degree deg a vg = m/n is a constant. Then, mir¬ 
roring should be applied to a vertex v if v’s degree is at least 
(.M ■ exp {deg avg /M}). 

PROOF. Consider a machine Mi that contains a set of n/M ver¬ 
tices, Vi = (vi, V2, where each vertex Vj haslj neigh¬ 
bors for 1 < j < n/M. Let us focus on a specific vertex Vj in Mi, 
and infer how large ij should be so that applying mirroring on Vj 
can reduce the overall communication even when a combiner is 
used. 

Consider an application where all vertices send messages to all 
their neighbors in each superstep, such as in PageRank computa¬ 
tion. Further consider vertex u £ r ou t(vj). If another vertex 
Vk £ Vi \ {vj} sends messages through Chm Sg and Vk also has 
u as its neighbor, then Vj ’s message to u is wasted since it can be 
combined with Vk s message to u. We assume the worst case where 
all vertices in Vi \ {vj} send messages through Ch msg . Since the 
neighbors of a vertex in G are randomly chosen among V, we have 

Pr{u £ Toutivk)} = i k /n, 

and therefore. 


Pr{v/s message to u is not wasted} 

n Pr{« ^ r out (u fe )} = n 

Vk£Vi\{vj} 


We regard each ik as a random variable whose value is chosen 
independently from a degree distribution (e.g., power-law degree 
distribution) with expectation E[ik] = m/n = deg aV g. Then, the 
expectation of the above equation is given by 
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For large graphs, we have 

Pr{vj’s message to u is not wasted} 

* lim = expf-%4, 

n—Vex) y n J y M J 

where the last step is derived from the fact that lim,^,*,}! — l/n) n = 

e" 1 . 

According to the above discussion, the expected number of Vj ’s 
neighbors that are not the neighbors of any other vertex(es) in Mi is 
equal to ij ■ exp {—deg a v g /M}. In other words, if mirroring is not 
used, Vj needs to send at least ij -exp{— deg a vg/M} messages that 


Theorem [2] states that the choice of r depends on the number 
of workers, M, and the average vertex degree, deg a v g . A clus¬ 
ter usually involves tens to hundreds of workers, while the aver¬ 
age degree deg avg of a large real world graph is mostly below 50. 
Consider the scenario where M = 100 and deg avg < 50, then 
t < 100e°' s = 165. This shows that mirroring is effective even for 
vertices whose degree is not very high. We remark that Theorem[2] 
makes some simplified assumption (e.g., G being a random graph) 
for ease of analysis, which may not be accurate for a real graph. 
However, our experiments in Section |TT| show that Theorem [2] is 
effective on real graphs. 

Mirror Construction. Pregel+ constructs mirrors for all vertices 
v with r o ut(u) > r after the input graph is loaded and before 
the iterative computation, although mirror construction can also be 
pre-computed offline like GraphLab’s ghost construction. Specif¬ 
ically, the neighbors in v’s adjacency list Tout is grouped by the 
workers in which they reside. Each group is defined as Ni = {u £ 
r o „t(t;) | hash(u) = Mi}. Then, for each group Ni, v sends 
(v;Ni) to worker Mi, and Mi constructs a mirror of v with the ad¬ 
jacency list Ni locally in Mi. Each vertex Vj £ Ni also stores the 
address of Vj’s incoming message buffer Ij so that messages can 
be directly forwarded to Vj by v’s mirror in Mi. 

During graph computation, a vertex v sends message (v, a(v )) 
to its mirror in worker M,. On receiving the message, Mi looks up 
v’s mirror from a hash table using v’s ID (similar to Tin described 
in Section [4}. The message value a(v) is then forwarded to the 
incoming message buffers of v’s neighbors locally in Mi. 

Handling Edge Fields. There are some minor changes to Pregel’s 
programming interface for applying mirroring. In Pregel’s inter¬ 
face, a vertex calls send_msg(tgt, msg) to send an arbitrary mes¬ 
sage msg to a target vertex tgt. With mirroring, a vertex v sends 
a message containing the value of its attribute a{v) to all its neigh¬ 
bors by calling broadcast(a(v)) instead of calling send_msg(u, a(v)) 
for each neighbor u £ r otl t(v). 

Consider the algorithms described in Section[3] For PageRank, a 
vertex v simply calls broadcast(pr (v) / | Tout (v) |); while for Hash- 
Min, v calls broadcast(min(v)). 

However, there are applications where the message value is not 
only decided by the sender vertex v’s state, but also by the edge that 
the message is sent along. For example, in PregeTs algorithm for 
single-source shortest path (SSSP) computation |13) , a vertex sends 
(d(v) + i(v, u)) to each neighbor u £ F 0 „t(v), where d(v) is an 
attribute of v estimating the distance from the source, and i(v, u) 
is an attribute of its out-edge ( v, u) indicating the edge length. 

To support applications like SSSP, Pregel+ requires that each 
edge object supports a function relay(msg), which specifies how 
to update the value of msg before msg is added to the incoming 
message buffer I, of the target vertex ry. If msg is sent through 
Chmsg, relayimsg) is called on the sender-side before sending. If 
msg is sent through Ch m i r , relay(msg) is called on the receiver- 
side when the mirror forwards msg to each local neighbor (as the 
edge field is maintained by the mirror). For example, in Figure [9] 
relav{msg) is called when msg is passed along a dashed arrow. 

By default, relay(msg) does not change the value of msg. To 
support SSSP, a vertex v calls broadcast(d(v)) in compute!), and 












meanwhile, the function relay(msg) is overloaded to add the edge 
length £(v, u ) to msg, which updates the value of msg to the re¬ 
quired value ( d(v ) + £(v, u )). 

Summary of Contributions. GPS does not use message com¬ 
bining, and therefore, its LALP technique are not as effective as 
our mirroring technique that is reinforced with message combiner. 
GraphLab's ghost vertex technique creates mirrors for all vertices 
regardless of the vertex degree, and thus it is also not as effective 
as our mirroring technique. As far as we know, this is the first 
work that considers the integration of vertex mirroring and mes¬ 
sage combining in Pregel’s computing model. In addition, we also 
identified the tradeoff between vertex mirroring and message com¬ 
bining in message reduction, and provided a cost model to auto¬ 
matically select vertices for mirroring so as to minimize the num¬ 
ber of messages. As we shall see in our experiments in Section [7T| 
the mirroring threshold computed by our cost model in Theorem 12] 
achieves near-optimal performance. In addition, we also cope with 
the case where the message value depends on the edge field, which 
is not supported by GPS’s LALP technique. 


6. THE REQUEST-RESPOND PARADIGM 

In Sections|T| |3.4| and |3.5| we have shown that bottleneck vertices 
can be generated by algorithm logic even if the input graph has no 
high-degree vertices. For handling such bottleneck vertices, the 
mirroring technique of Section [5] is not effective. To this end, we 
design our second message reduction technique, which extends the 
basic Pregel framework with a new request-respond functionality. 

We illustrate the concept using the algorithms described in Sec¬ 
tion 3j U sing the request-respond API, attribute broadcast in Sec¬ 
tion 3.1 1 is straightforward to implement: in superstep 1, each ver¬ 
tex v sends requests to each neighbor u £ r ou t(n) for a(u); in 
superstep 2, the vertex v simply obtains a(u ) responded by each 
neighbor u, and constructs r ou t(u). Similarly, for the S-V algo¬ 
rithm in Section 3.4| when a vertex v needs to obtain D[w\ from 
vertex w = D[v j, it simply sends a request to w so that D[vu\ can 
be used in the next superstep; for the MSF algorithm in Section [3~5| 
a vertex v simply sends a request to its supervertex D[v] = s so 
that _D[s] can be used to update D[v] in the next superstep. 

Request-Respond Message Channel. We now explain in de¬ 
tail how Pregel+ supports the request-respond API. The request- 
respond paradigm supports all the functionality of Pregel. In addi¬ 
tion, it supplements the vertex-to-vertex message channel Ch msg 
with a request-respond message channel, denoted by Ch req . 

Figure [TO] illustrates how requests and responses are exchanged 
between two machines Mi and M,j through Ch req . Specifically, 
each machine maintains M request sets, where M is the number of 
machines, and each request set Sto k stores the requests to vertices 
in machine M j,. In a superstep, a vertex v in machine Mj may call 
request(u) in its computeQ function to send request to vertex u for 
its attribute value a(u) (which will be used in the next superstep). 
Let hashQu) = i, then the requested vertex u is in machine Mi, 
and hence u is added to the request set Sto i of Mj. Although many 
vertices in Mj may send request to u, only one request to u will 
be sent from Mj to Mi since Sto t is a (hash) set that eliminates 
redundant elements. 

After computeQ is called for all active vertices, the vertex-to- 
vertex messages are first exchanged through Chmsg■ Then, each 
machine sends each request set Sto k to machine M* . After the re¬ 
quests are exchanged, each machine receives M request sets, where 
set Sf rom k stores the requests sent from machine Mk- In the exam- 
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Figure 10: Illustration of request-respond paradigm 


pie shown in Figure [TO] u is contained in the set Sf rom j in machine 
Mi, since vertex v in machine Mj sent request to u. 

Then, a response set Rtok is constructed for each request set 
Strom k received, which is to be sent back to machine Mk . In our 
example, the requested vertex, u £ Sf romj , calls a user-specified 
function respondQ to return its specified attribute a(u ), and adds 
the entry (u, a(u)) to the response set Rtoj. 

Once the response sets are exchanged, each machine constructs a 
hash table from the received entries. In the example shown in Fig¬ 
ure 10 the entry ( u , a(u )} is received by machine Mj since it is in 


the response set Rtoj in machine Mi. The hash table is available for 
the next superstep, where vertices can access their requested value 
in their computeQ function. In our example, vertex v in machine 
Mj may call get_resp(u) in the next superstep, which looks up u’s 
attribute a(u) from the hash table. 

The following theorem shows the effectiveness of the request- 
respond paradigm for message reduction. 


THEOREM 3. Let {vt,V 2 ,... ,vt} be the set of requesters that 
request the attribute a(u) from a vertex u. Then, the request- 
respond paradigm reduces the total number of messages from 2£ in 
Pregel's vertex-to-vertex message passing framework to 2 min (M, £), 
where M is the number of machines. 

PROOF. The proof follows directly from the fact that each ma¬ 
chine sends at most 1 request to u even though there may be more 
than 1 requester in that machine, and that at most 1 respond from 
u is sent to each machine that makes a request to u, and that there 
are at most min(M, £) machines that contain a requester. □ 


In the worst case, the request-respond paradigm uses the same 
number of messages as Pregel’s vertex-to-vertex message passing. 
But in practice, many Pregel algorithms (e.g., those described in 
Sections [3.4| and [3~5| have bottleneck vertices with a large number 
of requesters, leading to imbalanced workload and long elapsed 
running time. In such cases, our request-respond paradigm effec¬ 
tively bounds the number of messages to the number of machines 
containing the requesters and eliminates the imbalanced workload. 

Explicit Responding. In the above discussion, a vertex v simply 
calls request(u) in one superstep, and it can then call get_respQi) in 
the next superstep to get o(w). All the operations including request 
exchange, response set construction, response exchange, and re¬ 
sponse table construction are performed by Pregel+ automatically 
and are thus transparent to users. We name the above process as im¬ 
plicit responding, where a responder does not know the requester 
until a request is received. 

When a responder w knows its requesters v, w can explicitly 
call respond(v) in computeQ, which adds {w,w.respondQ) to the 
response set Rtoj where j = hash.Qu). This process is also illus- 














































Data 

Type 

|V| 

|E| 

AVG Deg 

Max Deg 

WebUK 

directed 

133,633,040 

5,507,679,822 

41.21 

22,429 

LiveJournal 

directed 

10,690,276 

224,614,770 

21.01 

1,053,676 

Twitter 

directed 

52,579,682 

1,963,263,821 

37.34 

779,958 

BTC 

undirected 

164,732,473 

772,822,094 

4.69 

1,637,619 

USA Road 

undirected 

23,947,347 

58,333,344 

2.44 

9 


Figure 11: Datasets (M = million) 

trated in Figure[l0] Explicit responding is more cost-efficient since 
there is no need for request exchange and response set construction. 

Explicit responding is useful in many applications. For example, 
to compute PageRank on an undirected graph, a vertex v can simply 
call respond(u) for each u £ r(u) topusha(u) = pr(u)/|r(u)| to 
v’s neighbors; this is because in the next superstep, vertex u knows 
its neighbors T(m), and can thus collect their responses. Similarly, 
in attribute broadcast, if the input graph is undirected, each vertex 
v can simply push its attribute a(v) to its neighbors. Note that 
data pushing by explicit responding requires less messages than 
by Pregel’s vertex-to-vertex message passing, since responds are 
sent to machines (more precisely, their response tables) rather than 
individual vertices. 

Programming Interface. Pregel-t- extends the vertex class in Pregel’s 
interface GD by requiring users to specify an additional template 
argument <R>, which indicates the type of the attribute value that 
a vertex responds. 

In computeQ, a vertex can either pull data from another vertex v 
by calling requestQv), or push data to v by calling respondQ)). The 
attribute value that a vertex returns is defined by a user-specified ab¬ 
stract function respondQ, which returns a value of type <R>. Like 
compute (), one may program respondQ to return different attributes 
of a vertex in different supersteps according to the algorithm logic 
of the specific application. Finally, a vertex may call get_resp{v) in 
computeQ to get the attribute of v, if it is pushed into the response 
table in the previous superstep. 

7. EXPERIMENTAL RESULTS 

We now evaluate the effectiveness of our message reduction tech¬ 
niques. We ran our experiments on a cluster of 16 machines, each 
with 24 processors (two Intel Xeon E5-2620 CPU) and 48GB RAM. 
One machine is used as the master, while the other 15 machines act 
as slaves. The connectivity between any pair of nodes in the cluster 
is lGbps. 

We used five real-world datasets, as shown in Figure [IT] (l)We- 
bUI^\ a web graph generated by combining twelve monthly snap¬ 
shots of the .uk domain collected for the DELIS project; (2 )Live- 
Journal (LJ) [^] a bipartite network of LiveJournal users and their 
group memberships; (3)7VWffe^] Twitter who-follows-who network 
based on a snapshot taken in 2009; (4)B7 tG] a semantic graph 
converted from the Billion Triple Challenge 2009 RDF dataset; 
(5)US4G] the USA road network. 

LJ, Twitter and BTC have skewed degree distribution; WebUK, 
LJ and Twitter have relatively high average degree; USA and We¬ 
bUK have a large diameter. 

Pregel+ Implementation. Pregel+ is implemented in C/C++ as 
a group of header files, and users only need to include the neces- 

2 http://law.di.unimi.it/webdata/uk-union-2006-06-2007-05 

“’http://konect.uni-koblenz.de/networks/livejoumal- 

groupmemberships 

4 http://konect.uni-koblenz.de/networks/twitter_mpi 
5 http://km.aifb.kit.edu/projects/btc-2009/ 

6 http://www.dis.uniromal.it/challenge9/download.shtml 


sary base classes and implement the application logic in their sub¬ 
classes. Pregel+ communicates with HDFS through libhdfs, a JNI 
based C API for HDFS. Each worker is simply an MPI process 
and communications are implemented using MPI communication 
primitives. While one may deploy Pregel+ with any Hadoop and 
MPI version, we use Hadoop 1.2.1 and MPICH 3.0.4 in our ex¬ 
periments. All programs are compiled using GCC 4.4.7 with -02 
option enabled. 

All the system source codes, as well as the source codes of the 
algorithms discussed in this paper, can be found in http : //www. 
cse.cuhk.edu.hk/pregelplus 

7.1 Effectiveness of Mirroring 

Figure |T2] reports the performance gain by mirroring. We mea¬ 
sure the gain by comparing with (l)Pregel+ without both mirror¬ 
ing and combiner, denoted by Pregel-noMC; (2)Pregel+ with com¬ 
biner but without mirroring, denoted by Pregel-noM; and (3)GPS 118 ] 
with and without LALP. The request-respond technique is not ap¬ 
plied in Pregel+ for this set of experiments. As a reference, we 
also report the performance of Giraph 1.0.0 QJ (with combiner) 
and GraphLab 2.2 (which includes PowerGraph j5]). 

We test the mirroring thresholds 1, 10, 100, 1000, and the one 
automatically set by the cost model given by Theorem [2] (which is 
199, 165, 62, 126, for WebUK, Twitter, U, BTC, respectively). But 
for the USA road network, its maximum vertex degree is only 9 and 
thus we do not apply mirroring with large thresholds. For GPS, we 
follow (8) and fix the threshold of LALP as 100. This is a rea¬ 
sonable choice, since 0 reports that this threshold achieves good 
performance in general, and we find that the best performance af¬ 
ter tuning the threshold is very close to the performance when the 
threshold is 100. We also report the preprocessing time of con¬ 
structing mirrors for Pregel+ and that of LALP for GPS in rows 
marked by “Preproc Time”. We also report the number of mes¬ 
sages sent by Pregel+ and GPS (note that Giraph does not report 
the number of messages, but the number should be the same as that 
of Pregel-noMC and Pregel-noM; while GraphLab does not employ 
message passing). 

We ran PageRank on the three directed graphs, and Hash-Min 
on the two undirected graphs in Figure [IT] For PageRank compu¬ 
tation, we use aggregator to check whether every vertex changes 
its PageRank value by less than 0.01 after each superstep, and ter¬ 
minate if so. The computation takes 89, 89 and 96 supersteps on 
WebUK, Twitter and LJ, respectively, before convergence. We do 
not run GraphLab in asynchronous mode for PageRank, since its 
convergence condition is different from the synchronous version 
and hence leads to different PageRank results. 

Mirroring in Pregel+. As Figure [IT] shows, mirroring signif¬ 
icantly improves the performance of Pregel-noM. in terms of the 
reduction in both running time and message number. The improve¬ 
ment is particularly obvious for the graphs, Twitter, LJ, and BTC, 
which have highly skewed degree distribution. Thus, the result also 
demonstrates the effectiveness of mirroring in workload balancing. 

Mirroring is not so effective for PageRank on WebUK, for which 
Pregel-noM has the best performance. The number of messages is 
only slightly decreased when mirroring threshold t = 1000, and 
yet it is still slower than Pregel-noM. This is because messages sent 
through Ch m ir are intercepted by mirrors which incurs additional 
cost. Since the degree of the majority of the vertices in WebUK is 
not very high, mirroring does not significantly reduce the number 
of messages, and thus, the additional cost of Chmir is not paid off. 

The results also show that the mirroring threshold given by our 
cost model achieves either the best performance, or close to the 
performance of the best threshold tested. The one-off preprocess- 
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PageRank 
on WebUK 

Comput Time 

2669* 

7732 

5603 

5561 

3475 

2784 

2935 

3909 

4020 

4834 

4262 


Preproc Time 


- 

162.29 

143.00 

46.68 

32.79 

26.66 

— 

663.34 




# of Msgs 

120107 

490184 

319614 

314212 

168317 

119889* 

134734 

487285 

377687 




PageRank 
on Twitter 

Comput Time 

1575 

3131 

1621 

1648 

1177 

1381 

1048 

1343 

750.11* 

1567 

1762 


Preproc Time 


- 

40.74 

41.34 

24.13 

8.63 

14.31 

— 

74.95 



— 

# of Msgs 

62276 

174730 

68430 

65770 

40980 

48616 

38873* 

174730 

78904 




PageRank 
on LJ 

Comput Time 

316.26 

541.53 

251.98 

255.26 

212.35 

243.32 

216.05 

316.45 

197.28* 

312 

662 


Preproc Time 



9.95 

7.72 

3.75 

1.07 

3.94 

— 

8.17 




# of Msgs 

6429 

21563 

5949 

3949* 

4209 

5162 

4359 

21563 

9665 




Hash-Min 

on BTC 

Comput Time 

26.97 

44.28 

29.95 

15.53 

9.55* 

10.69 

9.85 

37.99 

33.00 

93 

83 

155 

Preproc Time 


- 

20.74 

6.63 

5.92 

5.56 

5.41 

— 

3.52 




# of Msgs 

1189 

2419 

1294 

259.4 

126.1 

152.4 

122.5* 

1525 

716.4 




Hash Min 
on USA 

Comput Time 

546.86 

546.66 

542.69* 





1205 


5714 

2982 

627 

Preproc Time 


- 

4.52 



— 


— 

— 




# of Msgs 

8353 

8485 

8305 





8485 






Figure 12: Effects of mirroring (*: best result; Comput/Preproc time: Computation/Preprocessing time in sec; # of Msgs: # of 
messages in millions) 


ing time required to construct the mirrors is also short compared 
with the computation time. 

Comparison with Other Systems. Figure [12] shows that Pregel+ 
without mirroring (i.e., Pregel-noM) is already faster than both Gi- 
raph and GraphLab, which verifies that our Ch m sg implementation 
is efficient, and thus the performance gain by mirroring is not an 
over-claimed improvement gained over a slow implementation. 

Compared with GPS, the reduction in both message number and 
running time achieved by the integration of mirroring and com¬ 
biner in Pregel+ is significantly more than that achieved by LALP 
alone in GPS, which can be observed from (l)Pregel+ with mir¬ 
roring vs. Pregel-noMC, and (2)GPS with LALP v.s. GPS without 
LALP. In contrast to the claim in [18) that message combining is 
not effective, our result clearly demonstrates the benefits of inte¬ 
grating mirroring and combiner, and hence highlights the impor¬ 
tance of our theoretical analysis on the tradeoff between mirroring 
and message combining (i.e., Theorem[2]l. 

However, we notice that GPS is sometimes faster than Pregel+ 
even though much more messages are exchanged. We found it hard 
to explain and so we studied the codes of GPS to explore the reason, 
which we explain below. GPS requires that vertex IDs should be 
integers that are contiguous starting from 0,1, - - - , | V|; while other 
systems allow vertex IDs to be of any user-specified type as long as 
a hash function is provided (for calculating the ID of the worker that 
a vertex resides in). As a result of the dense ID representation, each 
worker in GPS simply maintains the incoming message buffers of 
the vertices by an array, and when a worker receives a message 
targeted at vertex tgt, it is put into tgt’s incoming message buffer 
(i.e., Itgt) whose position in the array can be directly computed 
from tgt. On the contrary, systems like Pregel+ and Giraph need to 
look up Itgt from a hash table using key tgt, which has extra cost 
for each message exchanged. 

We remark that there are good reasons to require vertex IDs to 
take arbitrary type, rather than to hard-code them as contiguous in¬ 
tegers. For example, the Pregel algorithm in |24| for computing 
bi-connected components constructs an auxiliary graph from the 
input graph, and each vertex of the auxiliary graph corresponds to 
an edge ( u , v) of the input graph. While we can simply use integer 
pair as vertex ID in Pregel+, using GPS requires extra effort from 
programmers to relabel the vertices of the auxiliary graph with con¬ 
tiguous integer IDs, which can be costly for a large graph. We note 
that, if one desires, he can easily implement GPS’s dense vertex 



Pregel+| ReqResp | Giraph | GPS 

Pregel+ | ReqResp | Giraph | GPS 

Attribute Broadcast on WebUK 

S-V on USA 

Time 

178.4 s 

84.53 s 

169.28 s 

83.71s* 

261.93 s 

137.69 s* 

690 s 

189.77 s 

Msg # 

11015 M 

2699 M* 

- 

10950 M 

6598 M 

3789 M* 

- 

6598M 


Attribute Broadcast on BTC 

S-V on BTC 

Time 

16.33 s 

13.31 s 

54.76 s 

8.69 s* 

408.78 s 

190.55 s* 

1531 s 

286.22 s 

Msg # 

772.8 M 

393.2 M* 

- 

772.8 M 

22393 M 

11232 M* 

- 

22393M 


Attribute Broadcast on LJ 

Minimum Spannin 

g Forest on USA 

Time 

11.66s 

9.09 s 

11.56s 

6.43 s* 

19.95 s* 

25.20 s 

259.63 s 

85.15 s 

Msg # 

449.2 M 

131.9 M* 


449.2 M 

387.1 M 

162.2 M* 

- 

387.1 M 


Attribute Broadcast on Twitter 

Minimum Spannin 

g Forest on BTC 

Time 

59.84 s 

29.65 s* 

71.35 s 

29.93 s 

83.36 s 

36.56s* 

350.15 s 

209.92 s 

Msg # 

3927 M 

1396 M* 

- 

3927 M 

2424 M 

1110 M* 


2424 M 


Figure 13: Effects of the request-respond technique 

ID representation in Pregel+ to further improve the performance 
for certain algorithms, but this is not the focus of our work which 
studies message reduction techniques. 

7.2 Effectiveness of Request-Respond Technique 

Figure[l3]reports the performance gained by the request-respond 
technique. We test the three algorithms in Section [3] to which the 
request-respond technique is applicable: attribute broadcast, S-V 
and minimum spanning forest. We also include Giraph and GPS 
as a reference. We do not include GraphLab since the algorithms 
cannot be easily implemented in GraphLab (e.g., it is not clear how 
a vertex v can communicate with a non-neighbor D[v] as in S-V 
and minimum spanning forest). 

The results show that Pregel-t- with request-respond, denoted by 
ReqResq, uses significantly less messages. For example, for at¬ 
tribute broadcast on WebUK, ReqResq reduces the message num¬ 
ber from 11,015 million to only 2,699 million. ReqResq also records 
the shortest running time except in a few cases where GPS is faster 
due to the same reason given in Section |TT| Another exception is 
when computing minimum spanning forest on USA, where Pregel+ 
is faster without request-respond. This is because vertices in USA 
have very low degree, rendering the request-respond technique in¬ 
effective, and the additional computational overhead is not paid off 
by the reduction in message number. 

8. CONCLUSIONS 

We presented two techniques to reduce the amount of commu¬ 
nication and to eliminate skewed communication workload. The 
first technique, mirroring, eliminates communication bottlenecks 

























































































caused by high vertex degree, and is transparent to programming. 
The second technique is a new request-respond paradigm, which 
eliminates bottlenecks caused by program logic, and simplifies the 
programming of many Pregel algorithms. Our experiments on large 
real-world graphs verified that our techniques are effective in re¬ 
ducing the communication cost and overall computation time. 
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